Loading the required data…
cluster.assignments <- read.csv("Intermediate_results/regularity_of_study/regularity_based_clusters.csv")
exam.scores <- read.csv(file = "Intermediate_results/exam_scores_with_student_ids.csv")
# remove email data
exam.scores <- exam.scores %>% select(-2)
# merge exam scores and clusters
clust.and.scores <- merge(x = cluster.assignments %>% select(-cl3),
y = exam.scores %>% select(-SC_MT_TOT),
by.x = "user_id", by.y = "USER_ID",
all.x = TRUE, all.y = FALSE)
Creating a baseline model with identified clusters as the random effect (no fixed effects)
Preparing the data for the model
lme_0_dat <- clust.and.scores %>% select(-user_id)
set.seed(seed)
lme_0 <- lmer(SC_FE_TOT ~ 1 + (1|cl4), data = lme_0_dat, REML = FALSE)
summary(lme_0)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: SC_FE_TOT ~ 1 + (1 | cl4)
Data: lme_0_dat
AIC BIC logLik deviance df.resid
3456.7 3469.2 -1725.4 3450.7 474
Scaled residuals:
Min 1Q Median 3Q Max
-2.6319 -0.7970 -0.1212 0.6746 2.2443
Random effects:
Groups Name Variance Std.Dev.
cl4 (Intercept) 22.90 4.785
Residual 78.81 8.878
Number of obs: 477, groups: cl4, 4
Fixed effects:
Estimate Std. Error t value
(Intercept) 18.159 2.434 7.461
## compute ICC
r.squaredGLMM(lme_0)
R2m R2c
0.0000000 0.2251254
22.51% of the total variance is explained by the cluster assignment
Checking if the model satisfies the assumptions for linear regression:
# assumption 1: the mean of residuals is zero
mean(resid(lme_0))
# OK
# assumption 2: homoscedasticity of residuals or equal variance
# assumption 3: Normality of residuals
check.residuals(lme_0)
check.residuals2(lme_0)
# OK
Use as fixed effects:
prep.sessions <- read.csv("Intermediate_results/regularity_of_study/on_topic_and_last_min_proportions.csv")
# str(prep.sessions)
lme_1_dat <- merge(x = prep.sessions %>% select(-ends_with("mad")),
y = clust.and.scores,
by = "user_id", all.x = F, all.y = T)
summary(lme_1_dat)
plot.correlations(lme_1_dat)
lme_1_dat <- lme_1_dat %>% select(-user_id)
set.seed(seed)
lme_1 <- lmer(SC_FE_TOT ~ on_topic_prop + on_topic_prop_sd + last_min_prop + last_min_prop_sd +
(1|cl4), data = lme_1_dat, REML = FALSE)
summary(lme_1)
Linear mixed model fit by maximum likelihood t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: SC_FE_TOT ~ on_topic_prop + on_topic_prop_sd + last_min_prop +
last_min_prop_sd + (1 | cl4)
Data: lme_1_dat
AIC BIC logLik deviance df.resid
3354.2 3383.2 -1670.1 3340.2 458
Scaled residuals:
Min 1Q Median 3Q Max
-2.71757 -0.75786 -0.08334 0.77814 2.26202
Random effects:
Groups Name Variance Std.Dev.
cl4 (Intercept) 13.66 3.696
Residual 75.15 8.669
Number of obs: 465, groups: cl4, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 21.162 3.595 43.700 5.886 5.09e-07 ***
on_topic_prop 4.039 3.368 462.300 1.199 0.2310
on_topic_prop_sd -11.682 5.557 465.000 -2.102 0.0361 *
last_min_prop 2.265 5.129 461.200 0.442 0.6590
last_min_prop_sd -4.107 3.153 461.500 -1.302 0.1934
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) on_tp_ on_t__ lst_m_
on_topc_prp -0.721
on_tpc_prp_ -0.730 0.588
last_mn_prp 0.070 -0.314 0.050
lst_mn_prp_ -0.306 0.276 0.016 -0.519
Only on_topic_prop_sd is significant; a unit increase, leads to a decrease of the final exam score
Compare the model with the baseline
# lme_1_base <- lmer(SC_FE_TOT ~ 1 + (1|cl4), data = lme_1_dat, REML = FALSE)
# anova(lme_1, lme_1_base)
Cannot be compared as they do not have the same number of observations: some observations were removed from the lme_1 model due to NA values
## compute ICC
r.squaredGLMM(lme_1)
R2m R2c
0.03849303 0.18641196
The overall model explains 18.64% of variability in the final exam score; only a small portion - 3.85% - of this variability is explained by the fixed factors
Checking if model assumptions hold
# if residuals are normally distributed with constant standard deviation
check.residuals(lme_1)
# check for multicolinearity
max(vif.mer(lme_1))
Assumptions do hold
Loading the data
weekday.sessions <- read.csv("Intermediate_results/regularity_of_study/weekday_session_props.csv")
# str(weekday.sessions)
lme_2_dat <- merge(x = weekday.sessions %>% select(1:8, 11),
y = clust.and.scores,
by = "user_id", all.x = FALSE, all.y = TRUE)
lme_2_dat <- lme_2_dat %>% select(-user_id)
#summary(lme_2_dat)
# since the count variables are on a very different scale than the entropy, standardize them
lme_2_st_dat <- scale.features(lme_2_dat)
#summary(lme_2_st_dat)
set.seed(seed)
lme_2 <- lmer(SC_FE_TOT ~ Sun_count + Mon_count + Tue_count + Wed_count + Thu_count + Fri_count +
Sat_count + weekday_entropy + (1|cl4), data = lme_2_st_dat, REML = FALSE)
summary(lme_2)
Linear mixed model fit by maximum likelihood t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: SC_FE_TOT ~ Sun_count + Mon_count + Tue_count + Wed_count + Thu_count +
Fri_count + Sat_count + weekday_entropy + (1 | cl4)
Data: lme_2_st_dat
AIC BIC logLik deviance df.resid
842.8 888.7 -410.4 820.8 466
Scaled residuals:
Min 1Q Median 3Q Max
-3.6402 -0.7160 -0.1153 0.7892 2.3632
Random effects:
Groups Name Variance Std.Dev.
cl4 (Intercept) 0.0000 0.000
Residual 0.3272 0.572
Number of obs: 477, groups: cl4, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.007812 0.032484 477.000000 0.240 0.81005
Sun_count 0.039472 0.045232 477.000000 0.873 0.38329
Mon_count 0.176776 0.041865 477.000000 4.222 2.89e-05 ***
Tue_count 0.132384 0.041767 477.000000 3.170 0.00162 **
Wed_count 0.100672 0.041470 477.000000 2.428 0.01557 *
Thu_count 0.139886 0.035479 477.000000 3.943 9.26e-05 ***
Fri_count 0.048541 0.040518 477.000000 1.198 0.23151
Sat_count -0.006031 0.033196 477.000000 -0.182 0.85592
weekday_entropy 0.083886 0.032368 477.000000 2.592 0.00985 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) Sn_cnt Mn_cnt Tu_cnt Wd_cnt Th_cnt Fr_cnt St_cnt
Sun_count -0.246
Mon_count -0.226 -0.335
Tue_count -0.073 0.005 0.000
Wed_count -0.254 -0.048 0.274 -0.045
Thu_count -0.013 -0.140 0.018 -0.024 -0.226
Fri_count 0.012 -0.002 -0.129 -0.104 -0.030 -0.274
Sat_count -0.109 -0.319 0.025 -0.127 -0.138 0.036 -0.232
wekdy_ntrpy 0.432 -0.228 -0.134 -0.038 -0.117 -0.013 -0.143 -0.204
As in regular linear model (Model 5), Mon, Tue, Wed, and Thu session counts are significant, as is the weekday entropy.
Compare the model with the baseline
lme_2_base <- lmer(SC_FE_TOT ~ 1 + (1|cl4), data = lme_2_st_dat, REML = FALSE)
anova(lme_2, lme_2_base)
Data: lme_2_st_dat
Models:
..1: SC_FE_TOT ~ 1 + (1 | cl4)
object: SC_FE_TOT ~ Sun_count + Mon_count + Tue_count + Wed_count + Thu_count +
object: Fri_count + Sat_count + weekday_entropy + (1 | cl4)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
..1 3 873.25 885.75 -433.62 867.25
object 11 842.82 888.66 -410.41 820.82 46.428 8 1.971e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model 2 is significantly better than the baseline.
## compute ICC
r.squaredGLMM(lme_2)
R2m R2c
0.231116 0.231116
Since marginal R2 and conditional R2 are exactly the same, it means that all the variability is explained by the fixed factors; the random factor does not contribute (also visible in the model summary).
Checking if residuals are normally distributed with constant standard deviation
check.residuals(lme_2)
check.residuals2(lme_2)
Not fully fine, but also not too bad
Check for multicolinearity
max(vif.mer(lme_2))
[1] 1.755446
It’s OK
Loading the data…
res.use.stats <- read.csv("Intermediate_results/regularity_of_study/daily_resource_use_statistics_w2-5_7-12.csv")
lme_3_dat <- merge(res.use.stats, clust.and.scores, by = "user_id", all.x = F, all.y = T)
lme_3_dat <- lme_3_dat %>% select(-user_id)
lme_3.1_dat <- lme_3_dat %>% select(starts_with("tot"), cl4, SC_FE_TOT)
plot.correlations(lme_3.1_dat)
set.seed(seed)
lme_3.1 <- lmer(SC_FE_TOT ~ tot_video_cnt + tot_exe_cnt + tot_mcq_cnt + tot_mcog_cnt +
tot_res_cnt + (1|cl4), data = lme_3.1_dat, REML = FALSE)
summary(lme_3.1)
Linear mixed model fit by maximum likelihood t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: SC_FE_TOT ~ tot_video_cnt + tot_exe_cnt + tot_mcq_cnt + tot_mcog_cnt +
tot_res_cnt + (1 | cl4)
Data: lme_3.1_dat
AIC BIC logLik deviance df.resid
3419.2 3452.6 -1701.6 3403.2 469
Scaled residuals:
Min 1Q Median 3Q Max
-2.82298 -0.77938 -0.01984 0.75193 2.82197
Random effects:
Groups Name Variance Std.Dev.
cl4 (Intercept) 19.45 4.410
Residual 71.38 8.449
Number of obs: 477, groups: cl4, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.198e+01 2.521e+00 5.900e+00 8.718 0.000141 ***
tot_video_cnt 3.904e-04 8.399e-04 4.751e+02 0.465 0.642307
tot_exe_cnt -7.283e-03 1.131e-03 4.743e+02 -6.438 2.96e-10 ***
tot_mcq_cnt 5.216e-03 3.407e-03 4.736e+02 1.531 0.126447
tot_mcog_cnt 6.081e-03 1.225e-02 4.729e+02 0.496 0.619819
tot_res_cnt 5.105e-03 1.924e-03 4.767e+02 2.654 0.008216 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) tt_vd_ tt_x_c tt_mcq_ tt_mcg_
tot_vid_cnt -0.035
tot_exe_cnt -0.350 -0.051
tot_mcq_cnt 0.025 -0.193 -0.065
tot_mcg_cnt -0.044 -0.019 0.056 -0.211
tot_res_cnt -0.142 -0.085 -0.225 -0.372 -0.187
Predictors with siginifican effect:
Compare the model with the baseline
lme_3.1_base <- lmer(SC_FE_TOT ~ 1 + (1|cl4), data = lme_3.1_dat, REML = FALSE)
anova(lme_3.1, lme_3.1_base)
Data: lme_3.1_dat
Models:
..1: SC_FE_TOT ~ 1 + (1 | cl4)
object: SC_FE_TOT ~ tot_video_cnt + tot_exe_cnt + tot_mcq_cnt + tot_mcog_cnt +
object: tot_res_cnt + (1 | cl4)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
..1 3 3456.7 3469.2 -1725.4 3450.7
object 8 3419.2 3452.6 -1701.6 3403.2 47.501 5 4.49e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The new model (lme_3.1) is significantly better than the baseline.
r.squaredGLMM(lme_3.1)
R2m R2c
0.08230082 0.27882260
The overall model explains 27.88% of variability in the final exam score; fixed effects explain only 8.23%
Checking if model assumptions are satisfied
# if residuals are normally distributed and if equality of variance holds
check.residuals(lme_3.1)
check.residuals2(lme_3.1)
# check for multicolinearity
max(vif.mer(lme_3.1))
It is questionable if the equality of variance assumption holds; other things are fine.
lme_3.2_dat <- lme_3_dat %>% select(starts_with("tot"), starts_with("mad"), cl4, SC_FE_TOT)
plot.correlations(lme_3.2_dat)
# remove mad_rec_cnt as highly correlated with tot_res_cnt
lme_3.2_dat <- lme_3.2_dat %>% select(-mad_res_cnt)
# summary(lme_3.2_dat)
# some variables have very different scales - need to be rescaled
lme_3.2_st_dat <- scale.features(lme_3.2_dat %>% select(-c(cl4, SC_FE_TOT)))
# summary(lme_3.2_st_dat)
# when rescalled, almost all regularity indicators (MAD) values become zero
Not applicable, as when rescalled, MAD values become zero; this is due to their highly unregular distribution with numerous outliers
Loading the data…
topic.stats <- read.csv("Intermediate_results/regularity_of_study/topic_counts_statistics_w2-5_7-12.csv")
lme_4_dat <- merge(topic.stats, clust.and.scores, by = "user_id", all.x = F, all.y = T)
lme_4_dat <- lme_4_dat %>% select(-user_id)
Note: initially, I wanted to use days (with each topic focus), but X_days variables are highly mutually correlated
lme_4.1_dat <- lme_4_dat %>% select(ends_with("prop"), cl4, SC_FE_TOT)
plot.correlations(lme_4.1_dat)
# remove orient_prop as highly correlated with metacog_prop; also prj_prop as having zero correlation with the exam score
lme_4.1_dat <- lme_4.1_dat %>% select(-c(prj_prop, orient_prop))
set.seed(seed)
lme_4.1 <- lmer(SC_FE_TOT ~ ontopic_prop + revisit_prop + metacog_prop + (1|cl4),
data = lme_4.1_dat, REML = FALSE)
summary(lme_4.1)
Linear mixed model fit by maximum likelihood t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: SC_FE_TOT ~ ontopic_prop + revisit_prop + metacog_prop + (1 | cl4)
Data: lme_4.1_dat
AIC BIC logLik deviance df.resid
3451.1 3476.1 -1719.6 3439.1 471
Scaled residuals:
Min 1Q Median 3Q Max
-2.36012 -0.74169 -0.08928 0.78733 2.30709
Random effects:
Groups Name Variance Std.Dev.
cl4 (Intercept) 20.62 4.541
Residual 76.97 8.773
Number of obs: 477, groups: cl4, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 31.5711 10.2162 384.2000 3.090 0.00215 **
ontopic_prop 7.4762 2.9122 473.2000 2.567 0.01056 *
revisit_prop -0.8163 2.6449 473.1000 -0.309 0.75772
metacog_prop -17.5504 9.8078 474.1000 -1.789 0.07418 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) ontpc_ rvst_p
ontopic_prp -0.257
revisit_prp -0.226 0.325
metacog_prp -0.951 0.077 0.077
The only significant fixed effect is ontopic_prop - the proportion of active days when a student was preparing for the week’s lecture
r.squaredGLMM(lme_4.1)
R2m R2c
0.01990326 0.22697643
The overall model explains 22.7% of variance in the final exam score; fixed effects contribute only 2% of explained variance
Compare the model with the baseline
lme_4.1_base <- lmer(SC_FE_TOT ~ 1 + (1|cl4), data = lme_4.1_dat, REML = FALSE)
anova(lme_4.1, lme_4.1_base)
Data: lme_4.1_dat
Models:
..1: SC_FE_TOT ~ 1 + (1 | cl4)
object: SC_FE_TOT ~ ontopic_prop + revisit_prop + metacog_prop + (1 |
object: cl4)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
..1 3 3456.7 3469.2 -1725.4 3450.7
object 6 3451.1 3476.1 -1719.6 3439.1 11.617 3 0.008816 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The new model (lme_4.1) is significantly better than the baseline.
lme_4.2_dat <- lme_4_dat %>% select(starts_with("tot"), cl4, SC_FE_TOT)
plot.correlations(lme_4.2_dat)
# remove tot_metacog_cnt as highly correlated with several other variables; also, tot_revisit_cnt has zero correlation with the final exam score
lme_4.2_dat <- lme_4.2_dat %>% select(-c(tot_metacog_cnt, tot_revisit_cnt))
set.seed(seed)
lme_4.2 <- lmer(SC_FE_TOT ~ tot_ontopic_cnt + tot_orient_cnt + tot_prj_cnt + (1|cl4),
data = lme_4.2_dat, REML = FALSE)
summary(lme_4.2)
Linear mixed model fit by maximum likelihood t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: SC_FE_TOT ~ tot_ontopic_cnt + tot_orient_cnt + tot_prj_cnt + (1 | cl4)
Data: lme_4.2_dat
AIC BIC logLik deviance df.resid
3453.0 3478.0 -1720.5 3441.0 471
Scaled residuals:
Min 1Q Median 3Q Max
-2.49365 -0.79573 -0.07265 0.72290 2.30886
Random effects:
Groups Name Variance Std.Dev.
cl4 (Intercept) 14.59 3.820
Residual 77.50 8.803
Number of obs: 477, groups: cl4, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.610e+01 2.141e+00 5.100e+00 7.517 0.000598 ***
tot_ontopic_cnt 1.846e-03 9.409e-04 4.769e+02 1.962 0.050356 .
tot_orient_cnt 7.955e-03 4.402e-03 4.769e+02 1.807 0.071400 .
tot_prj_cnt -2.110e-02 1.457e-02 4.728e+02 -1.448 0.148255
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) tt_nt_ tt_rn_
tt_ntpc_cnt -0.196
tot_rnt_cnt -0.167 -0.303
tot_prj_cnt -0.065 0.087 -0.633
r.squaredGLMM(lme_4.2)
R2m R2c
0.02603713 0.18035337
Very poor model…
lme_4.3_dat <- lme_4_dat %>% select(starts_with("mad"), cl4, SC_FE_TOT)
plot.correlations(lme_4.3_dat)
Better not to, since the mad_X_cnt variables have very low correlation with the final exam score - significant fixed effect cannot be expected.
Indicators are computed at the week level, based on the following principle: a score of one is given to a student (for a given week), if he/she used certain kind of resource (e.g. video) more than the average (median) use of the that resource type in the given week
Loading the data
res.use.ind <- read.csv("Intermediate_results/regularity_of_study/res_use_indicators_w2-13.csv")
str(res.use.ind)
lme_5_dat <- merge(x = res.use.ind, y = clust.and.scores,
by = "user_id", all.x = FALSE, all.y = TRUE)
lme_5_dat <- lme_5_dat %>% select(-user_id)
summary(lme_5_dat)
plot.correlations(lme_5_dat)
# res_ind and MCQ_ind are highly correlated, remove one of them
lme_5_dat <- lme_5_dat %>% select(-VIDEO_ind)
set.seed(seed)
lme_5 <- lmer(SC_FE_TOT ~ MCQ_ind + EXE_ind + RES_ind + METACOG_ind + (1|cl4),
data = lme_5_dat, REML = FALSE)
summary(lme_5)
Linear mixed model fit by maximum likelihood t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: SC_FE_TOT ~ MCQ_ind + EXE_ind + RES_ind + METACOG_ind + (1 | cl4)
Data: lme_5_dat
AIC BIC logLik deviance df.resid
3376.0 3405.2 -1681.0 3362.0 470
Scaled residuals:
Min 1Q Median 3Q Max
-3.07989 -0.71301 -0.01201 0.76355 2.59049
Random effects:
Groups Name Variance Std.Dev.
cl4 (Intercept) 10.12 3.181
Residual 65.78 8.110
Number of obs: 477, groups: cl4, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 18.6868 1.9966 7.0000 9.360 3.28e-05 ***
MCQ_ind 0.7786 0.1651 477.0000 4.715 3.18e-06 ***
EXE_ind -0.9767 0.1415 476.7000 -6.902 1.65e-11 ***
RES_ind 0.3550 0.1575 466.7000 2.254 0.0247 *
METACOG_ind -0.0258 0.1527 473.2000 -0.169 0.8659
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) MCQ_nd EXE_nd RES_nd
MCQ_ind -0.124
EXE_ind -0.384 0.089
RES_ind -0.220 -0.387 -0.142
METACOG_ind -0.130 -0.389 0.031 -0.016
Predictors with significant effect:
Compare the model with the baseline
lme_5_base <- lmer(SC_FE_TOT ~ 1 + (1|cl4), data = lme_5_dat, REML = FALSE)
anova(lme_5, lme_5_base)
Data: lme_5_dat
Models:
..1: SC_FE_TOT ~ 1 + (1 | cl4)
object: SC_FE_TOT ~ MCQ_ind + EXE_ind + RES_ind + METACOG_ind + (1 |
object: cl4)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
..1 3 3456.7 3469.2 -1725.4 3450.7
object 7 3376.0 3405.2 -1681.0 3362.0 88.69 4 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model is significantly better than the baseline.
r.squaredGLMM(lme_5)
R2m R2c
0.1835686 0.2924003
The best model so far: it explains 29.24% of the variability in the final exam score; out of that, 18.35% are explained by the fixed factors.
Checking if model assumptions are satisfied
# if residuals are normally distributed and if equality of variance holds
check.residuals(lme_5)
check.residuals2(lme_5)
# check for multicolinearity
max(vif.mer(lme_5))
It can be said that the assumptions hold
Indicators are computed at the week level, based on the following principle: a score of one is given to a student (for a given week), if his/her number of events related to a particular topic type (e.g. revisiting) was above the average (median) number of events with that topic type in the given week
Weeks 6 and 13 are excluded from these computations, as during these weeks one can expect different behavioral patterns than usual.
Loading the data
topic.ind <- read.csv("Intermediate_results/regularity_of_study/topic_based_indicators_w2-5_7-12.csv")
str(topic.ind)
lme_6_dat <- merge(x = topic.ind, y = clust.and.scores,
by = "user_id", all.x = FALSE, all.y = TRUE)
lme_6_dat <- lme_6_dat %>% select(-user_id)
summary(lme_6_dat)
plot.correlations(lme_6_dat)
# orient_ind and metacog_ind are highly correlated, remove one of them
lme_6_dat <- lme_6_dat %>% select(-orient_ind)
set.seed(seed)
lme_6 <- lmer(SC_FE_TOT ~ ontopic_ind + revisit_ind + metacog_ind + prj_ind + (1|cl4),
data = lme_6_dat, REML = FALSE)
summary(lme_6)
Linear mixed model fit by maximum likelihood t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: SC_FE_TOT ~ ontopic_ind + revisit_ind + metacog_ind + prj_ind + (1 | cl4)
Data: lme_6_dat
AIC BIC logLik deviance df.resid
3446.8 3476.0 -1716.4 3432.8 470
Scaled residuals:
Min 1Q Median 3Q Max
-2.43718 -0.77507 -0.05424 0.76676 2.52504
Random effects:
Groups Name Variance Std.Dev.
cl4 (Intercept) 14.40 3.795
Residual 76.17 8.728
Number of obs: 477, groups: cl4, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 16.92282 2.26804 6.20000 7.461 0.000255 ***
ontopic_ind 0.51851 0.17969 470.10000 2.886 0.004087 **
revisit_ind -0.44263 0.16810 475.60000 -2.633 0.008735 **
metacog_ind 0.19592 0.20887 476.70000 0.938 0.348728
prj_ind 0.03693 0.30348 474.30000 0.122 0.903203
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) ontpc_ rvst_n mtcg_n
ontopic_ind -0.284
revisit_ind -0.260 0.109
metacog_ind -0.044 -0.269 -0.304
prj_ind -0.171 0.013 0.035 -0.473
Significant fixed effects:
Compare the model with the baseline
lme_6_base <- lmer(SC_FE_TOT ~ 1 + (1|cl4), data = lme_6_dat, REML = FALSE)
anova(lme_6, lme_6_base)
Data: lme_6_dat
Models:
..1: SC_FE_TOT ~ 1 + (1 | cl4)
object: SC_FE_TOT ~ ontopic_ind + revisit_ind + metacog_ind + prj_ind +
object: (1 | cl4)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
..1 3 3456.7 3469.2 -1725.4 3450.7
object 7 3446.8 3476.0 -1716.4 3432.8 17.912 4 0.001284 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model is significantly better than the baseline.
r.squaredGLMM(lme_6)
R2m R2c
0.04183826 0.19421923
The model explains 19.42% of the variability in the final exam score; out of that, 4.18% are explained by the fixed factors.
Checking if model assumptions are satisfied
# if residuals are normally distributed and if equality of variance holds
check.residuals(lme_6)
#check.residuals2(lme_6)
# check for multicolinearity
max(vif.mer(lme_6))
It’s fine - the assumptions hold.
Loading the data
reg.ind <- read.csv("Intermediate_results/regularity_of_study/gaps_between_consecutive_logins_w2-13.csv")
str(reg.ind)
lme_7_dat <- merge(x = reg.ind %>% select(user_id, median_gap), y = clust.and.scores,
by = "user_id", all.x = FALSE, all.y = TRUE)
lme_7_dat <- lme_7_dat %>% select(-user_id)
summary(lme_7_dat)
plot.correlations(lme_7_dat)
set.seed(seed)
lme_7 <- lmer(SC_FE_TOT ~ median_gap + (1|cl4),
data = lme_7_dat, REML = FALSE)
summary(lme_7)
Linear mixed model fit by maximum likelihood t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: SC_FE_TOT ~ median_gap + (1 | cl4)
Data: lme_7_dat
AIC BIC logLik deviance df.resid
3452.3 3469.0 -1722.2 3444.3 473
Scaled residuals:
Min 1Q Median 3Q Max
-2.6382 -0.7942 -0.1186 0.6717 2.2550
Random effects:
Groups Name Variance Std.Dev.
cl4 (Intercept) 11.61 3.408
Residual 78.18 8.842
Number of obs: 477, groups: cl4, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 19.4150 1.8147 3.6700 10.698 0.000671 ***
median_gap -1.0121 0.3716 270.8400 -2.724 0.006879 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
median_gap -0.243
Median gap, measured in days, is significant: one unit (day) increase in this predictor leads to a 1.01 decrase in the student’s final exam score.
r.squaredGLMM(lme_7)
R2m R2c
0.02499718 0.15107312
The model explains 15.11% of the variability in the final exam score; out of that, only 2.5% are explained by the fixed factors.
Checking if model assumptions are satisfied
# if residuals are normally distributed and if equality of variance holds
check.residuals(lme_7)
#check.residuals2(lme_6)
# check for multicolinearity
max(vif.mer(lme_7))
It’s fine - the assumptions hold.